


(SJ\SA-CB- 13^312) OPITISAL DESIGN? OE hU 1374-28054 

DNSOPEEVISEC ACAF^IIVE CLASSIFIES ail’H 
OUKISICWN PBIC2S (Eice Oaiv,) 32 p 

EC $4.75 CSC! 1 2A UEclas 

G3/19 43436 ^ 



INSTITUTE FOR COMPUTER SERVICES AND APPLICATIONS 


RICE UNIVERSITY 


275-025-013 


Optimal Etesign of an Unsupervised 
Adaptive Classifier 
with 

Unknown Priors 
by 

Demetrios Kazakos 
ICSA 

Rice University 


ABSTRACT 

An adaptive detection scheme for M hypotheses is analyzed. We assume 
that the probability density function under each hypothesis is known, and 
that the prior probabilities of the M hypotheses are unknown and sequen- 
tially estimated. Each observation vector is classified using the current 
estimate of the prior probabilities. Using a set of nonlinear transforma- 
tions, and applying stochastic approximation theory, we design an optimally 
converging ac&ptive detection ana estimation scheme. 

The optimality of the scheme lies in the fact that convergence to the true 
prior probabilities is ensured, and that the asymptotic error variance is 
minimum, for the class of nonlinear transformations considered. 

We obtain also an expression for the asymptotic mean square error 
variance of our scheme. 
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L Introduction 


In general, there are three approaches to nonsupervised learning 
clas sif ie r s - detect o rs . 

The first approach is the method of mixtures, in which we estimate the 
unknown parameters of a mixture distribution of input patterns. [ 1 ] 

It has been found, in general, that learning algorithms of this type are 
not simple to implement, and that they are slow to coverge, even though 
convergence to the true parameters can be guaranteed under certain loose 
restrictions. [ 2 ] 

The second approach is the method of constructing a discriminant function 
by an iterative procedure. It is simpler, but usually it does not lead to 
optimal classification. One of the main techniques applicable to this 
approach is clustering, which has been investigated in [ 3 ] , [ 4 ] , and 
elsewhere. 

The third approach is the decision directed method. It is a straightforward 
application of supervised learning methods, hence it is simple. Scudder 
( 6 ] , Agrawala [ 5 ] , Davisson and Schwartz [ 7 ] , have discussed some 
learning algorithms based on this approach. The disadvantage of the 
method is that the estimates are usually asymptotically biased, due to 

classification errors. 

In the present paper, we will use an improved version of the decision- 
directed approach. 

A decision- directed detector (classifier) uses previous decisions to 
estimate unknown parameters. On the basis of these estimates, the 
detector structure is modified for subsequent decision. The fundamental 
idea is that the detector assumes all past decisions correct, and on this 
basis, he tries to improve his performance by adjusting his decision 
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parameters. Applying this idea, Scudder [6] considered the binary 
detection of unknown signal versus no signal in noise. He estimated the 
signal as the sample mean estimate based on observations that were 
classified as containing the signal. Convergence of his estimate was 
heuristically argued. His estimate was asymptotically biased. 

Davisson and Schwartz [7] studied a decision- directed detector using 
previous decisions to estimate the prior probabilities as relative frequen- 
cies of decisions in favor of each hypothesis. Their estimate is asympto 
tically biased, but for certain applications the bias is sufficiently small 
for practical purposes. 

They found bounds to the probability that the estimate of the prior will 
’’run away” to 0 or 1. 

In an unpublished study [ 8 ] , the author has found similar bounds to the 
”run away” probability, for a decision -directed receiver where both the 
signal amplitude and the prior probability are sequentially updated, using 
previous decisions. 

In order to have optimal performance of an M-ary detection scheme, 
accurate knowledge of the prior probabilities is necessary. 

The degradation in the probability of error when in correct prior probabi 
lities are used, has been computed in a closed form. 

In the present paper, the method of Davisson and Schwartz [ 7 ] for simul 
taneous detection and sequential estimation of the prior probabilities, is 
generalized to arbitrary probability density functions and substantially 
improved by the introduction of nonlinear transformations. 

The prior probability updating procedure is made to converge to the true 
values, in an optimal fashion, in the sense that the asymptotic error 
variance is minimized. 

A stochastic approximation theorem due to Sacks [9 ] is invoked. 
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Possibly, the most interesting conclusion of the present work, is the fact 
that we can improve the behavior of a sequential estimation procedure by 
using a memoryless nonlinear transformation. 

II. Binary detection with inaccurate prior probability 

We now consider the problem of binary detection with inaccurate knowledge 
of the prior probability. 

Let H , be the two hypotheses, with corresponding prior probabilities 

X ^ 

TT and 1 - tt ^ and probability density functions f^^ (x) and (x). 

The observation is x, x f: We pose the following convenient 

restriction on the p. d. f‘s, 

Vx c e" , f^ (x) > 0 , f 2 (x) > 0 and continuous. 

It is well known that the optimal decision rule that minimizes the average 
probability of error is : 

Decide if tt f ^ (x) & (1 - tt) f^ (x) 

Decide H 2 otherwise. 

Hence, the knowledge of tt is essential for optimality. 

Let Pe (p) be the average probability of error, when p is the estimate 
of T7 used in the decision rule. 

It is then easy to show that : 

Pe (p) - Pe (rr) = J [^f]^ (x) - (I-’t) f 2 (x) ] dx a 0 
R (tt , p) 

where 


R (TT , p) 



- / I -tt 

min f 

\ TT 


1::^) ^ s max ,kP ) \ 

P ' P ^ J 


The above formula gives us in compact form the suboptimality of a 
decision rule that uses an incorrect estimate, p , of the true prior, tt . 
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JIL Adaptive detection- estimation scheme for 2 hypotheses 
We now assume that (x) , (x) are known and positive for all 

X c e” , and that is unknown. 

The method of simultaneous detection and estimation of ^ employed in 
[7] is the following. 

Assume that n past observations x^^ ... x^ have been classified, 
n^ of them to and 1 X 2 of them to H 2 • A natural estimate of 

TT , is then 

P = "1 

_JL 

n 


When observation j received, its classification is then based 


on p : 


Decide x . , e H, 
n + 1 1 


if 


<^n + 1> 
+ 1 > 


> 



Decide ^ ^ ^ ^2 • 

pn can be expressed as : 


where 

Pn = i I '"j 

n j = 1 



« 

1 if , 

1 -Pk 
Pk 



. 0 otherwise . 



Written in a recursive estimation form : 


Pn+l 
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The expected value of _l_ j conditioned on p^^ is given by : 

^ + 1 I Pn ^ Pn ■ n + 1 [ Pn " ^ ^ + 1 I P|i ^ ] 

From the above form, it is clear that if p converges to a value q , 
then this value will satisfy the equation : 

I p„ =q) 
or 

ttJ [fj <x) - (x) ^ dx + J f 2 (x) dx - q = 0 

R(q) R(q) 

where 

R(q) = {x; fi (X) . > 

f2 (X) 'J 

In general, the root of this equation is not equal to , and therefore, 
the procedure leads to an asymptotically biased estimate of . 

We now introduce a modified sequential detection- estimation scheme, 
in order to improve on the original. 

Let L (x) , g (x) be two nonlinear functions defined for x c [0 , 1 ] . 
Then, the following estimation algorithm is proposed : 

Pn+ 1 = Pn ■ l^<Pn) ' [g<Pn> ' ’^n + 1 1 

n + 1 

The new regression function of the modified algorithm, is : 

M(p„) = E [l(p^) [g(p^) - j ] I p^ j 

= L (p„) { g (p„) - J [ n fj (X) + (1 -TT ) f 2 (X) ] dx } 

R(Pn) 

Necessary condition for having an asymptotically unbiased estimate, is 
to have the value P^j = ^ as a root of the regression equation : 


M (TT) = 0 
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This condition is achieved by choosing the function g as follows : 

g(s) = J[sfj(x) + (l-s)f 2 (x)j dx 
R(s) 

for s e [0 , 1 ] . 

Substituting g into the previous equation, we have : 

M(Pn> = L(Pjj) . (Pj^ -n) . G (p^^) 

where 

G(s) = J [fj(x) - f 2 (x) ] dx 
R (s) 

The function G(s) is monotone increasing for s e [0 , 0.5] and 
monotone decreasing for s g [0. 5 , 1 ] , as shown in Appendix I . 
Also, G{0) = G(1) = 0 , G(s) >0 for s e ( 0 , 1 ) . 

The form of the function G(s) is given in Fig. 1. 


Fig. 1 
s 

For reasons to be explained, we assume that we have the knowledge 
that the unknown prior rr , lies between and tt^ , where 

0 < tt^< 0. 5 , 0. 5 < tt2< 1 . 
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Let 

Z(p) = L(p) [ g(p) - W ] - M(p) 


for 

p ^ I2 , 


where 

‘2 = ["1 ’"2] 


and 


1 


W ={ 


if 1 


f, (X) 


f2(x) 


1 - P 
P 


jjO otherwise . 

Then, the Robbins -Monro Stochastic Approximation procedure that gives 
the sequential estimates of rr , is written : 


Pn+l = P„-(n+l)'^ [ ‘^<Pn> + ^(Pn) ] 

We invoke now a theorem due to Sacks [9] , in a slightly modified 
version, to fit the circumstances. 

The assumptions to be checked are : 

Assumption (1) 

(x-tt)M(x) > 0 
for all X e I 2 > x / tt 


Assumption (2) 

For all X e I 2 and some positive constant , 

I M (x) j ^ ^1 I ^ ~ ^ 1 

and for every > ^2 * 0 < ^ 
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inf 


M ( X ) I >0 where the inf is taken 


for X ^ I2 and t^ ^ | x - ^ | ^ 


X - rr 


0 


Assumption ( 3 ) 

For all X c I2 5 

M (x) = (x - (x , ^ ) 

where 15 (x , n ) = 0 ( |x - n| ) as 
and where >0 

Assumption ( 4 ) 

(a) sup E ( x) < “ 

X e U 

2 2 

(b) limEZ^(x) = CT 

X— ^ 

Assumption ( 5 ) 

{ Z (x) } are identically distributed random variables (conditioned 
on x) . 

Assume, further, that a^^ > 4 . 

Then, the stochastic approximation procedure 

p' = p - (n+ 1)'^ . L(p ) . [ g(p ) - W ] 

^n + 1 n n'- n n+l-J 


if p ' , 1 ^ 

1 n "F 1 1 


n + 1 


p' if tt < v' ^ ^ 

^n + 1 1 n + 1 


n if n ^ P ^ , 1 
2 2 n + 1 


TT 


converges to ^ , and the error n^(p„ - w ) is asymptotically 

2 “1 

normally distributed with mean 0 and variance a (2a^ - 1) 
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Note : 


If the condition f is not satisfied, the convergence of 

the error variance is slower than n"^, as Sakrison [10] points 
out. 

For our particular case, 

M(p^) = L(p^) . (p^ - -) . G(p^) 

hence, if we restrict L (x) > 0 , for x c I2 , 

the assumption (1) is satisfied. 

Also, if we further restrict L (x) to be bounded, assumption (2) is 
easily satsified. 

That is, the condition : 

0 < < L (x) < K3 , for X e I2 , 

satisfies assumptions (1) and (2) . Assumption (3) is satisfied if 
we let 

a^^ = M' {^) . 

For our case, if we assume that G' (x) exists, we have : 

M' (x) = L' (x) (x - n ) . G (x) + L (x) . G (x) + L (x) . G' (x) 

m'(’t) = L(tt) [g(tt) + g'(7t) ] 

Also, 

^ = lim E I j^L (x) [g (z) " W ] - M (Z) ] | z| 

• Z— ► TT 

After some straightforward calculations, we find ; 

(TT) g (TT) (1 - g (tt) ) 



The formula for the asymptotic error variance then becomes ; 


var - tt ) 


g(") ( 1 - g(^) ) 

2L(tt) [g(tt) + g'(tt) ] - 1 


4 

In Fig. 2 , we plot the function G(tt) + G (tt) , 

For 0 i tt < tto , the function is positive, and for tt ^ tto , it is 


negative. 

The crossover point tto , lies between 0. 5 and 1 . 



Therefore, if tt < tto, we are ensured that G(tt)+G (tt) > 0 
V tt € I2 . 

We are now in a position to find the optimal function L/ (tt) , that will 
minimize the variance expression. It is a straightforward matter to see 
that the optimum function L is : 
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L(1T) = |^G(’T) + G(^)J , fT B I 2 

Using the above L , the achievable minimum asymptotic error variance 

g(^) ( i-g(’^) ) 

V = ; 2 > " ^ h 

[g(tt) + gV) ] 

In Appendix I, we prove why G(Z) has the form of Fig. 1 for 
general density functions. 

In Appendix II, it is shown how we can easily compute g (tt) , G (rr) , 
G^tt) for the multivariate Gaussian case. 

IV, Adaptive M-ary detection- estimation scheme with unknown priors 
In the present section, we are considering the problem of adaptive detection 
with unknown prior probabilities under M hypotheses, M > 2 . The 

hypotheses correspond to prior probabilities , 

and probability density functions fj (x) , f 2 (x) , ... , * 

For simplicity, we assume again that f-(x) are positive and 
continuous V x 


We need to estimate M - 1 of the prior probabilities only, 
i ... . th ^ _ 


Let pjj , j = 1 , . . . . M be the 


n'“ estimate of ttj 


Let 


^ [ ^1 ' ^2 ' ‘ - 1 ^ 


ind P 


r 1 2 M - i 1 

n = [PnPn Pn J 

rhen, the observation x^^ _j_ ^ will be classified according to the decision 
rule : 

Decide if Pn ^m <^n + 1 > 


M - 1 1 T 
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Again, to achieve optimality in our decision rule, we must find a method 
of forcing the vector to converge to the true value tt , ''as fast as 

possible. ” 

As before, a natural estimate of rr is the relative frequencies vector 
decisions in favor of the several classes. 

Let n be the number of past observations, and let nj^ , k = 1 , ... , M 

be the number of those classified as belonging to . 

n = n. + n« + , . . . n 
1 2 M 

Then, a natural estimate of tt is : 

’ k 


>n = i I 


w. 


s = 1 


where 


W 


s+1 


1 Pn ^k(^s + l ) = Pn 


m 


s +1 


U) otherwise. 

k = 1, 2, ... , M. 

We have the following sequential estimation algorithm : 

= P*' - (n+l)‘l (p*; - ) 

n+1 n \n n+1/ 


k = 1, . . « , M . 


Written in a vector form : 


n + 1 


rl 


= Mn+ 1) " . 


r 1 

p 
*^n 


M-1 


W 

] 

w 


1 " 
n+1 

M-1 


n n+1 

This algorithm has the fault of the original one in one dimension. 

It is asymptotically biased. 

The modification to be introduced now is a nontrivial generalization of the 


one dimensional case. 
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For reasons of convergence, we assume we have the knowledge that 
TT c , where is a subset of ( 0, 1 ^ to be specified later. 

In the case M = 2, we have the previously defined internal I2 = , tt ^] . 

A desirable property that we will assign to the internal , is : 

If 

" ® » •’i 1 

i = 1 , ... , M - 1 

where 

0 < X. < h^ < 1 

X. are small positive numbers, and h. are close to and below 1 . 
We will assume, therefore, that we have knowledge of the region , 


and hence i. , h. are known. 
We define the saturation function. 


sat [x , X , h ] 


X if X IS i 

i X if X c [x , h ] 

h if X h 


After that, we are describing the modified sequential estimation procedure 
for the vector tt . 


/ 


P 


n + 1 


and 


n + 1 


- (n + 1)'^ . A(P^) . L(P^) . 


Si (V - 

S2(Pn>-Wn+l 


M-1 

Sm- 1 ■ ^n+1 


sat 


r 'j 
[Pn+l 



j = 1 , . 


« • 


, M- 1 
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where 


and 


^n +1 previously defined random variables 

gj { ) are scalar functions, defined for e 

A ( Pn ) is a positive scalar function defined on 
L(P) is an (M-l)x(M-l) matrix 


n 


‘■(•'n* ■ , 1,J = 1 


M - 1 


where 


L. . ( P ) are scalar functions defined on L , . 

1 j n M 

All of the above functions L , g , A will be designed for improvement 

of the original algorithm. 

We compute : 




w 


J 

n + 1 


n] 


= gj ( P„ ) - Pr [v[ fj (X) = max pj (x) ] 

M 

= Sj<Pn> - I "s Jfs 

s = l Rj(P„) 


where 


R. (P ) = 

J n' 


X ; Pn fj (x) 




If we let 


gj(p„) - 


M 

1 Pn J fs 

« = 1 Rj(P„) 


dx 


j = 1, ... , M - 1 
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then 


EFg. (P) - Ip 1 

L^j n' n+1 | nj 

= Z (Pn ■ "s ) J ^ 
s = l Rj(P„) 

M - 1 

= l ( Pn ■ "s ) J fs + (Pif - 

® ^ Rj (P„) R. (P ) 

j n' j n' 

M - 1 

= I (Pn ■ "s) J[^s<^> - 

^ = 1 R.(P„) 

We defind the following (M~ l)x(M- 1) matrix F ( Pj^ ) » with 
elements 

^Sj <Pn> = J[*s<=‘> - 
«j<Pn) 

S, j = 1, . . . , M - 1 

Then, the (M - 1) dimensional regression function for the modified 
multidimensional algorithm, is : 

M(P„) . A(P„). L(P„). E[gj(P„) - . 

spifp' -<+i I p„f 

or 

M(Pn) = A(P^) . L(P^) . F(P^) . (P„ - ^ ) 

Hence : 

M ( tt) = 0 
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We note that by properly choosing the functions Sj have 

achieved to make the procedure asymptotically unbiased, assuming 

that it converges. We still have to ensure convergence 

Let 

Z(P„) = A(P^) L(P^) {[gi<Pn> ■ 

M - I , T 

- “n+i ] ■ J 

defined for P„ « • 

To ensure convergence and asymptotic normality of the error, we now 
invoke the multidimensional version of Sacks theorem [9] . 

The assumptions to be satisfied, are : 

Assumption (1) 

M (tt) = 0 , and for every e > 0 

inf(x-Tr)^ M(x)> 0 
where the inf is taken for 
X c and e'^ > || x - Tr|| > e 

Assumption (2) 

There exists a positive constant k^^ such that : 

I M(x) ||< kj ||x - 1T II 

for all ^ ^ ^ M 

Assumption (3) 

For all X c 1^ , 

M (x) = B . (x - tt) + 5 (x , tr) 

where B is an (M"l)x(M-l) positive definite matrix, 

II 6 (X , TT ) II = 0 ( j X - TT II ) . as 1 X - TT II ► 0 
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Assumption (4) 

sup E |Z(x)|| >® 

X e Im 

T 

lim E Z (x) Z (x) = S 

X ► fT 

where S is a nonnegative definite matrix. 

Assumption (5) 

I Z ( X ) j are identically distributed. 

( conditioned on x ) 

Let b b be the eigenvalues of B in decreasing order. 

1 M - 1 

Write B = PDP”^ , where P is orthogonal and D is the diagonal 
matrix with diagonal elements ( b ^ ... b^ ~ 1 ^ 

Let s_ be the ( i, j ) th element of S, and let s?'j be the ( i, j )th 

element of S* = P ^ S P 

Let b , > 2 
M - 1 

Then, n^ ( P^^ - ) is asymptotically normal, with zero mean, and 

covariance matrix PQP'^, where Q is the matrix whose 
( i , j ) f h element is ( b. + ~ 1 ) ^ . s 

We now assume that the matrix F (P) is nonsingular for all P® 

Satisfaction of the above condition depends on the form of the probability 
density functions 

(x) , ... . > and on the interval 1^^ . 
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For reasons to be seen immediately, we choose : 

L(P) = F^(P) 

Also, we restrict the scalar function A (P) to be bounded : 

0 < k. < A(P) < ko , VP e I ^ 

L M 

Hence, the inf of Assumption (1) is bounded from below : 

inf(x - rr) M(x) = inf(x-Tr) A(x). F (x) F(x) (x-tt) i 

^ k 2 infp^(y) 
y e I 

M 

2 T 

where p ( y ) is the minimum eigenvalue of F (y) F(y) 

2 

Because of the assumption that F(y) is nonsingular, p (y) is positive. 
Hence, Assumption (1) is satisfied. Assumptions (2) and (4) are easily 
shown to hold, and Assumption (3) also holds, with 

B = A(tt) . F^(tt) F(it) 

We need now to compute the covariance matrix S. 

In Appendix III, we show that 

S(TT) = AItt) . F^(it) . R(tt) . F(tt) 
where R(tt) is a (M-l)x(M’-l) matrix with elements 
km ' ' \ V km ^m " ' / 

where 

1 for k = m 
0 for k ^ m 

T 

If Qi . . . q are the eigenvalues of F F in decreasing magnitude, 

1 M-1 

we can write 
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M - 1 



F 


P diag(q 

1 


q 


-1 

) . P 


where P is orthogonal. 

The eigenvalues of B are then 
bj = Aq. 


and 


B = A P diag ( q . . . q ) P 

1 M - 1 


-1 


S* = P’^ R F P 


Let 


m. . be the elements of the matrix 
ij 

P“^ F^ R F P 


Then 

s* = m 
ij ij 

and Q has elements 

(Aq. + Aq - 1) 

1 j 



To conclude : 

For every tt e i , the matrices F, P, R are fixed. The only 

M 

parameter to be adjusted, is A ( rr ) . 

The restrictions A(tt) must satisfy are : 

0 < k^< A(rr) <ko 1 

Furthermore, we must have 

A (rr) . min q ( tt) > J 
k k 

This last condition is essential in order to have mean square convergence 
of the error of the order n’ ^ . If it is not satisfied, as Sakrison [ 10 ] 
points out, convergence is slower than n ^ . 
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The optimal choice of A(ir) . hence, is the value that minimizes the 
trace of PQP'^ , where Q has elements 

Q = A2(Aqj + Aq. - 1)"^ . m.. 

under the constraint ; 

A min q > i 
k k 

We have 

T(A) = trace(P Q p'*^) = trace ( Q P"’^ P) = trace(Q) 

Hence, we wish to minimize 

M- 1 j 

T(A) = a2 I \ 

k = l 

under the constraint : 

2 A > 1 

k 

for k = 1 , , M - 1 

where 

m >0 
kk ^ 

and 

q>q >...>q >0 

^12 M - 1 

The function T(A) is positive in the region 

A > (2q )"^ 

M - 1 

For A— ► + » , T(A) — ® 

and for A\(2q )'^ . T(A)-^+- 

Since it is also a ratio of polynomials, it must have a number of local 

minima for A > ( 2 q ) 

M - 1 
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The derivative of T ( A ) is : 

M - 1 Aq - 1 

T*(A) = 2A y m 

^ kk 2 

k = 1 (2Aq - 1) 

k 

For A a (q ) ^ , T (A) > 0 

M - 1 

/ 

Hence, the region of interest for seeking zeros of T (A) , is 

The number of zeros of T (A) is at most 2(M - 1) . 

Let A. , i = 1 2(M-1) , be the zeros of T (A) that are 

in I(q ) • 

M - 1 

Then, the optimal value of A is : 

Ao (tt) = arg [minT(A.) ] 

The above procedure was done for a fixed tt e . 

Doing the same thing for a mesh in » we can construct the optimal 

nonlinearity A , ( rr) , tt c I . 

Therefore, the trace of the error covariance matrix, has the asymptotic 
minimum value of : 

trace j^n^ (Pn " 1 

o ^ 1 -1-1 

►Ao (tt) . ^ (^) [2Ao (tt) (tt) - 1 ] 

k = 1 
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V. Conclusions 


It has been shown that we can estimate efficiently the prior probabilities, 
even in the presence of detection errors. The cost we have to pay , is 
the construction of the above nonlinear functions of tt , 

For the binary case and for multivariate Gaussian probability density 
functions, the nonlinear transformation functions are easy to construct. 
The important conclusion is, that the use of nonlinear transformations 
can improve the properties of stochastic approximation methods. 


- 22 - 



APPENDIX I 


For the two hypotheses and for general probability density functions, the 
function 

G(z) = J (x) - f 2 (x) ] dx 
fl<?) 1 -z 

f2 ( ?) 2 

has the form given in Fig. 1. 

Proof : 

a) . For z e [0, 0. 5] , 

fj (x) . f 2 ^x) s (1-z). z'^ a 1 

and ( 1 - z ) z' ^ Is monotone decreasing, hence G ( z ) is monotone 
increasing. Also, G (0) ^0 . 

b) . For z R [0.5, 1] , z'^d - z) s 1 

G(z) = J[fi(x) - f 2 <x) ] dx - J[f2<*) ■ 

— a 1 la -i a 1 - z 

f 2 (?) z 

The second integral is a positive, monotone increasing function of z , 

hence G ( z ) is monotone decreasing for z e [ 0. 5 , 1 ] . 

Also, G( 1 ) =0 

Hence, G(z) has a maximum at z = 0. 5 and has the form given 
in Fig. 1 . 
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APPENDIX II 


For the two hypotheses case and for Gaussian n-dimensional probability 
density functions, the nonlinearities g(Tr) , G(tt) , Lt(tt) can be 
constructed by using a method due to Fukunaga and Krile [11] . 

The essential characteristic of the method is the linear transformation 
of the observation vector to a new one that has components statistically 
independent under both hypotheses. 

We are interested in computing the following two integrals : 


Si(it) = J fj(x) dx 


f2(?) 



82(11) = J ^2^^^ 

f,(?) 

^ 1 - TT 

f 2 ( ^ ) tt 

for ® 1 2 

Let fj(x) = N(x, 0 , Rj^) 
f 2 (x) = N(x, M, R 2 ) 

where M = M 2 - = difference of mean vectors. 

Let A be the n x n matrix satisfying the relations : 
A Rj A~ = I 

A R 2 A^ = A 
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where a 


= diag(Xj ... \^) , with ... the eigenvalues 

satisfying : 

|r 2 ■* X Rj =0 

For Rj , R 2 positive definite, all x. are positive. 

J 

Let 11“"^ (uj 

If X is the original observation vector, the new one is- 
Y = Ax = (yj ... 

The statistics of Y are : 

E(Y I Hj) = 0 , E(Y I H 2 ) = u 

T 

E(YY^ 1 Hj) = I . e[(y - y ) (y - u ) | ] = A 

We can see easily, then, that 

Sj (^) = Pr [u(Y) < 0 I Hi ] 

where 

n 

U(Y) = ^ {y? - J_(yj - Wj)^ -Hnxj - 2Ln[TT (1 ] 

j = 1 

Since y^ are independent Gaussian random variables with zero mean 
and unit variance, the characteristic function of U ( Y ) can be easily 
computed. 

= E {exp (]<» . U(Y) ) I Hj } 

= K(j(») . TT (jd)) 

m = 1 
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where 



the vector Z has zero mean and unit variance Gaussian independent 
components. 

Then, 

S2<^) = Pr [ V(Z) < 0 I H2 ] 

where 




The characteristic function of V ( Z ) , is then : 


M, (j ») = K (j <» ) -ir F2m 

where, F 2 m(jm) have the same form with Fjj^(j(B> , with 
corresponding parameters : 

^2m “ ^ 

^2m “ (^m ^ ) 

'’ 2 m " ■ (®2m *’2m ) ®2m ) ''' ^"'m 

Hence, M,(ja)) and can be easily expressed in terms of 

u and A • 

Then, the functions S^(tt) , k = 1. 2 can be expressed as an integral 
involving Mj^(ju>), as follows : 

+ 0D 

S^(TT) = 2-'- p'^ J<«'‘lm[M^ du, 

O 

where p = 3. 14159 
k = 1, 2 

The functions G(tt) , g(Ti) can then be expressed as : 

G(TT) = Sj(tT) - S2 (tt) 

g(TT) = TtSj(tt) + (1 - TT ) S2(tt) 

In the formula for ^ appears only in the first factor. 
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/ / 

Hence, the computation of S (tt) and therefore of G (tt) 

K 

t 

one more integration, using K ( j ) instead of K ( j «? ) . 


, involves 
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APPENDIX III 


The matrix R(tt) has elements 


R 


km 


(n) = E { (gk(TT)-W„^j ) (g^(TT) - ) I n 


8k<") E('^n+lh)+ EC'^n+l %+l 


We have shown that 


E (w„+i -) = gk(-) 


Also, 


k m 

n + 1 n + 1 


W 


0 


n + 1 


Therefore, 

km 


km 


for 

k = m 

for 

k ^ m 

tt) 

] 


where 


1 for k = m 


km 


L 0 for k ^ m 
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